[1] "/Users/hlicht"
University of Cologne
April 18, 2024
getwd(), file.path, basename, dirname, dir.exists, file.exists, create.dir~
├── Desktop
│ ├── file.txt
│ ├── subfolder
│ │ ├── file2.txt
│ │ └── file3.txt
│ └── another_subfolder
│ └── file4.txt
...
~ represents your home directory
Paths can be absolute or relative
/home/hlicht/Desktop/file.txtsubfolder/file2.txt (when ~/Desktop is your working directory)getwd()getwd() retrieves the current working directory[1] "/Users/hlicht/Dropbox/teaching/text_wrangling_in_r/slides"
Note: These slides are created with quarto, which always sets the working directory to the folder that contains the .qmd file. Hence, we are in the slides/ folder.
Option 1 👉
Option 2 👉
Select an existing R project in R Studio
file.path.Platform$file.sep for compatibilitybasename and dirnamebasename for obtaining the file name from pathdirname for obtaining the directory part of pathdir.exists and file.existsdir.exists for directories, file.exists for filesdir.create and unlinkreadr::read_csv for reading comma-separated file (CSV) with extension “.csv” , readr::read_tsv for reading tab-separated file (TSV) with extensions “.tsv”readr::read_delim for custom delimiters (e.g., “;” for semicolon-separated files)readxl::read_excel to read Excel filesofficer::read_docxlibrary(officer)
# read Word document
fp <- file.path("..", "data", "files", "test_file.docx")
doc <- read_docx(fp)
content <- docx_summary(doc)
content doc_index content_type style_name
1 1 paragraph Title
2 2 paragraph Author
3 3 paragraph First Paragraph
text
1 Test file
2 Hauke Licht
3 This is just a text document for illustrating how to read word and PDF files into R.
level num_id
1 NA NA
2 NA NA
3 NA NA
pdftools::pdf_textjsonlite::read_jsonlibrary(jsonlite)
# read JSON file
fp <- file.path("..", "data", "nontabular", "test.json")
data <- read_json(fp)
data$null_field
NULL
$logical_field
[1] TRUE
$numeric_field
[1] 1
$string_field
[1] "a value"
$list_field
$list_field[[1]]
[1] "a"
$list_field[[2]]
[1] "list"
$list_field[[3]]
[1] "of"
$list_field[[4]]
[1] "values"
$dictionary_field
$dictionary_field$subfield1
[1] "another value"
$dictionary_field$subfield2
$dictionary_field$subfield2[[1]]
[1] "a"
$dictionary_field$subfield2[[2]]
[1] "list"
$dictionary_field$subfield2[[3]]
[1] "of"
$dictionary_field$subfield2[[4]]
[1] "subvalues"
readr::read_lines, purrr::map, jsonlite::fromJSONlibrary(readr)
library(purrr)
library(jsonlite)
# read JSON lines and convert
fp <- file.path("..", "data", "nontabular", "test.jsonl")
lines <- read_lines(fp)
data <- map(lines, fromJSON)
data[[1]]
[[1]]$id
[1] "001"
[[1]]$text
[1] "I'm sorry, I don't understand. Can you try again?"
[[2]]
[[2]]$id
[1] "002"
[[2]]$text
[1] "What is the average length of an elephant's ear?"
xml2::read_xml to read XML fileslibrary(xml2)
# read XML file
fp <- file.path("..", "data", "files", "example.xml")
data <- read_xml(fp)
data{xml_document}
<library>
[1] <book id="1">\n <title>The Great Gatsby</title>\n <author>F. Scott Fitz ...
[2] <book id="2">\n <title>To Kill a Mockingbird</title>\n <author>Harper L ...
[3] <book id="3">\n <title>1984</title>\n <author>George Orwell</author>\n ...
xml2::read_html to read HTML contentlibrary(xml2)
# read HTML file
fp <- file.path("..", "data", "files", "example.html")
data <- read_html(fp)
data{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n <h1>Library Catalog</h1>\n <table>\n<thead><tr>\n<th>ID</t ...
Many commonly used political (text) dataset are available online
For replicability and version control purposes, it’s a best practice to program the download of these data (instead of manually downloading and saving them)
Many replication materials for articles published in poltical science journals are available through Harvard Dataverse:
Many journals have their own “dataverses”. Here some:
IMPORTANT: In the URLs listed above, the last part behind the last “/” is called “Dataverse ID” – we need this to automatically download files from a journals dataverse
We will use the replication data of the article
Bestvater, S., & Monroe, B. (2023). Sentiment is Not Stance: Target-Aware Opinion Classification for Political Text Analysis. Political Analysis, 31(2), 235-256.
The repository is https://doi.org/10.7910/DVN/MUYYG4
We will use the replication data of the article
Bestvater, S., & Monroe, B. (2023). Sentiment is Not Stance: Target-Aware Opinion Classification for Political Text Analysis. Political Analysis, 31(2), 235-256.
The repository is https://doi.org/10.7910/DVN/MUYYG4
We will use the replication data for the article
Barberá, P., Boydstun, A. E., Linn, S., McMahon, R., & Nagler, J. (2021). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, 29(1), 19–42.
The repository is https://doi.org/10.7910/DVN/MXKRDE
We will use the replication data for the article
Barberá, P., Boydstun, A. E., Linn, S., McMahon, R., & Nagler, J. (2021). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, 29(1), 19–42.
The repository is https://doi.org/10.7910/DVN/MXKRDE
We will use the replication data for the article
Barberá, P., Boydstun, A. E., Linn, S., McMahon, R., & Nagler, J. (2021). Automated Text Classification of News Articles: A Practical Guide. Political Analysis, 29(1), 19–42.
The repository is https://doi.org/10.7910/DVN/MXKRDE
We will use the example of the article
van Atteveldt, W., van der Velden, M. A. C. G. & Boukes, M. (2021) The Validity of Sentiment Analysis: Comparing Manual Annotation, Crowd-Coding, Dictionary Approaches, and Machine Learning Algorithms. Communication Methods and Measures, 15(2), 121-140.
The repository is https://github.com/vanatteveldt/ecosent
sentences_df <- read_csv(gold_sentences_texts_url)
labels_df <- read_csv(gold_sentences_labels_url)
colnames(sentences_df)[1] "id" "headline" "google" "deepl"
[5] "dutch_lemmas" "google_lemmas" "deepl_lemmas"
[1] "id" "value"
Note: we use read_csv because the file we want to download is a .csv file
library(dplyr)
# compute number of labels per headline
labels_df |>
group_by(id) |>
summarise(n_labels = n()) |>
# count numbers of labels per headlines
count(n_labels)# A tibble: 1 × 2
n_labels n
<int> <int>
1 1 284
Note: each of 284 headlines has only one label
We can download ZIP archives and extract selected files in R
We will use the replication data for the article
Benoit, K., Conway, D., Lauderdale, B. E., Laver, M., & Mikhaylov, S. (2016). Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data. American Political Science Review, 110(2), 278–295.
The Github repository is here https://github.com/kbenoit/CSTA-APSR
url <- "https://github.com/kbenoit/CSTA-APSR/raw/4bc6cbc48a4eeff557cbb03b8ede73b29e36aa00/Data%20-%20CF%20jobs/CFjobresults.zip"
# download
temp <- tempfile()
download.file(url, temp, quiet = TRUE)
# list contents
conts <- unzip(temp, list = TRUE)$Name
head(conts, 4)[1] "f240807.csv" "f246554.csv" "f246609.csv" "f263548.csv"
# A tibble: 2,971 × 28
`_unit_id` `_created_at` `_golden` `_canary` `_id` `_missed` `_started_at`
<dbl> <chr> <lgl> <lgl> <dbl> <lgl> <chr>
1 323988547 9/25/2013 01:2… FALSE NA 1.06e9 NA 9/25/2013 01…
2 323988547 9/25/2013 11:3… FALSE NA 1.06e9 NA 9/25/2013 11…
3 323988547 9/26/2013 04:4… FALSE NA 1.06e9 NA 9/26/2013 04…
4 323988547 9/26/2013 16:0… FALSE NA 1.06e9 NA 9/26/2013 16…
5 323988547 9/26/2013 16:2… FALSE NA 1.06e9 NA 9/26/2013 16…
6 323988548 9/25/2013 11:2… FALSE NA 1.06e9 NA 9/25/2013 11…
7 323988548 9/25/2013 12:0… FALSE NA 1.06e9 NA 9/25/2013 12…
8 323988548 9/26/2013 03:5… FALSE NA 1.06e9 NA 9/26/2013 03…
9 323988548 9/26/2013 16:1… FALSE NA 1.06e9 NA 9/26/2013 16…
10 323988548 9/26/2013 19:0… FALSE NA 1.06e9 NA 9/26/2013 19…
# ℹ 2,961 more rows
# ℹ 21 more variables: `_tainted` <lgl>, `_channel` <chr>, `_trust` <dbl>,
# `_worker_id` <dbl>, `_country` <chr>, `_region` <chr>, `_city` <chr>,
# `_ip` <chr>, econ_scale <dbl>, econ_scale2 <lgl>, policy_area <dbl>,
# soc_scale <dbl>, soc_scale2 <lgl>, orig__golden <lgl>,
# econ_scale_gold <lgl>, policy_area_gold <dbl>, post_sentence <chr>,
# pre_sentence <chr>, sentence_text <chr>, sentenceid <dbl>, …